Feature Selection For Gene Selection And Prediction

نویسنده

  • Art Barnes
چکیده

In many machine learning applications, one must perform feature selection in order to obtain good classification performance. For example, selecting a good feature subset is critical when the sample size is small compared with the dimesionality and noise in the observations. When this is the case, it is necessary to reduce the number of features to avoid modeling noise in the classifier. When the number of features becomes large compared with the class separation, the curse of dimensionality sets in. Many possible decision boundaries are possible that give zero error over the training sets, but do not generalize well [3]. By contrast, using too few features or a poor subset may not allow for the classes to be well-separated in feature space. Bayesian variable selection using Gibbs sampling works by maximizing the posterior likelihood of a classifier model numerically. This is performed by simulating draws from the likelihood density function using a Gibbs sampler. The Gibbs sampler is a technique for drawing from a probability density function (PDF), when the actual form of the PDF may not be possible to compute analytically [2]. Several issues present themselves in feature selection, of which one of the most important is the selection of the number of variables. For a given classification problem with a set number of training examples, the error of the classifier decreases as features are added until it reaches a trough, and then increases again, as shown in figure 2 [3], [8]. For this reason one wishes to have a means to automatically determine the optimal number of variables to use for classification. Another related issue is the ranking of features. This is useful to set the number of features by hand by ordering features according to usefulness, and using the ranking to set a cutoff. Last, features may be useful in conjunction with each other, or may be redundant. In designing a classifier, one wishes to get rid of the redundant features, while exploiting relationships between features that may not yield good classification performance when taken on their own.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gene Identification from Microarray Data for Diagnosis of Acute Myeloid and Lymphoblastic Leukemia Using a Sparse Gene Selection Method

Background: Microarray experiments can simultaneously determine the expression of thousands of genes. Identification of potential genes from microarray data for diagnosis of cancer is important. This study aimed to identify genes for the diagnosis of acute myeloid and lymphoblastic leukemia using a sparse feature selection method. Materials and Methods: In this descriptive study, the expressio...

متن کامل

Neuro-Fuzzy Based Algorithm for Online Dynamic Voltage Stability Status Prediction Using Wide-Area Phasor Measurements

In this paper, a novel neuro-fuzzy based method combined with a feature selection technique is proposed for online dynamic voltage stability status prediction of power system. This technique uses synchronized phasors measured by phasor measurement units (PMUs) in a wide-area measurement system. In order to minimize the number of neuro-fuzzy inputs, training time and complication of neuro-fuzzy ...

متن کامل

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

A New Hybrid Method for Improving the Performance of Myocardial Infarction Prediction

Abstract Introduction: Myocardial Infarction, also known as heart attack, normally occurs due to such causes as smoking, family history, diabetes, and so on. It is recognized as one of the leading causes of death in the world. Therefore, the present study aimed to evaluate the performance of classification models in order to predict Myocardial Infarction, using a feature selection method tha...

متن کامل

Prediction of blood cancer using leukemia gene expression data and sparsity-based gene selection methods

Background: DNA microarray is a useful technology that simultaneously assesses the expression of thousands of genes. It can be utilized for the detection of cancer types and cancer biomarkers. This study aimed to predict blood cancer using leukemia gene expression data and a robust ℓ2,p-norm sparsity-based gene selection method. Materials and Methods: In this descriptive study, the microarray ...

متن کامل

Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine

We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005